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ABSTRACT 


Abstraction is a core principle of Distributional Semantic Models (DSMs) that learn semantic 
representations for words by applying dimensional reduction to statistical redundancies in 
language. Although the posited learning mechanisms vary widely, virtually all DSMs are 
prototype models in that they create a single abstract representation of a word’s meaning. This 
stands in stark contrast to accounts of categorisation that have very much converged on the 
superiority of exemplar models. However, there is a small but growing group of accounts in 
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psychology, linguistics, and information retrieval that are exemplar-based semantic models. 
These models borrow many of the ideas that have led to the prominence of exemplar models in 
fields such as categorisation. Exemplar-based DSMs posit only an episodic store, not a semantic 
one. Rather than applying abstraction mechanisms at learning, these DSMs posit that semantic 
abstraction is an emergent artifact of retrieval from episodic memory. 


Abstraction is an essential mechanism to learn and rep- 
resent meaning in memory. Theoretical notions of 
abstraction vary across research domains, but tend to 
emphasise aggregation across exemplars to a central 
“average” representation (Reed, 1972), transforming sen- 
sorimotor input to a deeper knowledge representation 
(Barsalou, 1999; Damasio, 1989), or reducing idiosyn- 
cratic dimensions to focus on those attributes most 
common to members of a category (Rosch & Mervis, 
1975). In modern computational models of semantic 
memory, notions of abstraction are formally specified 
and applied to real-world linguistic data to evaluate the 
structure of semantic memory that the mechanisms 
would produce. 

Modern distributional semantic models (DSMs; e.g. 
Landauer & Dumais, 1997) have become immensely 
popular in the cognitive literature due to their success 
at fitting human experimental data, their utility in real- 
world applications, and their insights as models of cogni- 
tion. In general, DSMs learn distributed representations 
for word meanings from statistical redundancies across 
linguistic experience. Because they are often applied to 
text corpora as learning data, DSMs are also referred to 
as “corpus-based” models, although, in principle, their 
learning mechanisms can be applied to covariational 
structure in any dataset (e.g. perception, speech, etc.). 

Despite the wide range of DSMs in the literature, they 
virtually all share the characteristic that they are 
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prototype models: They attempt to collapse the entire 
set of a word's linguistic exemplars into a single econ- 
omical representation of word meaning. However, this 
practice is in contrast to the literature on categorisation 
that has largely disposed of prototype representations 
in favour of exemplar-based models. In this paper, | high- 
light the contradiction between literatures, and attempt 
to build a case for exemplar-based models of distribu- 
tional semantics. 

Abstraction is a core mechanistic principle of DSMs. 
Most DSMs apply some form of dimensional reduction 
to words’ experienced linguistic contexts, essentially 
abstracting over the dimensions that are idiosyncratic 
to each context, and converging on the stable 
higher-order dimensions that optimally explain the 
covariational pattern of words across contexts. Aggre- 
gation is also a core principle of virtually all DSMs - 
multiple linguistic contexts are averaged across, 
either explicitly or implicitly, resulting in a single 
central representation for the word that is stored. A 
word's vector pattern across these reduced dimensions 
is thought to represent its generic meaning. Hence, 
each DSM formally specifies an abstraction mechanism 
by which episodic memory is transformed into seman- 
tic memory; in this sense, DSMs embody the idea of 
abstraction and allow us to quantitatively evaluate 
various process explanations of aggregation and 
dimensional reduction. 
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The particular mechanisms posited for abstraction in 
DSMs differ in several theoretically important ways, and 
include reinforcement learning, probabilistic inference, 
latent induction, and Hebbian learning. Enumerating 
the differences between the mechanisms used by each 
model is beyond the scope of this article (see Jones, 
Willits, & Dennis, 2015 for a review); but all DSMs essen- 
tially specify an abstraction mechanism to formalise the 
classic notion in linguistics that “you shall know a word 
by the company it keeps” (Firth, 1957). The theoretical 
points that follow apply broadly to all DSMs that posit 
abstraction at learning, regardless of specific learning 
mechanism. As two examples of DSMs with very different 
architectures and learning mechanisms, | briefly consider 
classic Latent Semantic Analysis (LSA; Landauer & 
Dumais, 1997) and the newest DSM - Google’s 
word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 
2013). 

LSA begins with a word-by-document frequency 
matrix of a text corpus. This initial “episodic” matrix rep- 
resents first-order relationships: words are similar if they 
have frequently co-occurred in contexts. LSA then 
applies singular value decomposition to this episodic 
matrix (cf. factor analysis) retaining only the 300 or so 
dimensions that account for the largest amount of var- 
iance in the original matrix. Singular value decompo- 
sition serves as an abstraction mechanism, reducing 
the dimensionality and emphasising second-order 
relationships that were not obvious in the episodic 
matrix. In the reduced space, words will be similar if 
they occur in similar contexts, even if they never directly 
co-occur (e.g. category exemplars and synonyms). But 
much of the information idiosyncratic to specific con- 
texts that would be required to reconstruct the full orig- 
inal episodic matrix has now been lost. Hence, LSA 
achieves an abstracted semantic representation by 
applying truncated SVD to the history of episodes. 

Mikolov et al.’s (2013) word2vec achieves a similar 
outcome, albeit in a rather different way. Word2vec is a 
“neural embedding” model that has been extremely suc- 
cessful in computational linguistics. To a cognitive scien- 
tist, the model is essentially a feedforward connectionist 
network (cf. Rumelhart networks explored in Rogers & 
McClelland, 2004) with some optimisation tricks that 
allow it to be scaled up to large amounts of text data. 
Word2vec has localist input and output layers, each 
with one node for each word in the corpus. The input 
and output layers are fully connected via a hidden 
layer of ~300 nodes, which allows the model to learn 
nonlinear patterns in the text corpus. When a word is 
experienced, the other words that it occurs with serve 
as its context. With the node for the context words acti- 
vated, activation feeds forward to the output layer with 


the desired output being the activation of the correct 
target word, with other words being inhibited. The 
error signal (difference between true and observed 
output pattern) is then backpropagated through the 
network to increase the likelihood that the correct 
word will be activated at the output layer given the 
input words in the future. Hence, the context is used to 
predict the word.” After training on a large text corpus, 
a word’s pattern across the hidden layer begins to 
show higher-order relationships that go beyond the 
first order relationships it was being trained to predict. 
Very much like LSA’s reduced representation, the 
reduced representation across word2vec’s hidden layer 
has now learned similarity between words that are pre- 
dicted by similar contexts. While LSA used SVD for data 
reduction, word2vec used backpropagation; but both 
models essentially abstract semantics from episodes. 

These similarities can be seen across all of the DSMs — 
all achieve the desired outcome of a reduced abstraction 
of word meaning from episodic co-occurrences. The jury 
is still out on which (if any) mechanism is the most plaus- 
ible model of how humans construct semantic represen- 
tations. But one property that is clear to all of these 
“abstraction at learning” DSMs is that they may be classi- 
fied as prototype models. The models attempt to create a 
single abstracted representation of meaning for each 
word, and this single semantic representation is what is 
stored and used in downstream fitting of psycholinguis- 
tic data. 

There are many similarities in the literatures on 
semantic memory and categorisation, enough that it is 
likely that the cognitive mechanisms that subserve 
semantic learning and category learning may be 
heavily related to each other. But one key contradiction 
stands out: While the literature on categorisation has 
very much converged on the superiority of exemplar- 
based models, DSM models are all essentially prototype 
models. 


Lessons from categorization models 


Categorisation and semantic abstraction have many 
similarities, and it is commonly believed that the 
process of categorisation may be used to produce 
semantic structure (see Rogers & McClelland, 2011 for a 
review). The categorisation literature has been domi- 
nated for many years by a debate between prototype 
and exemplar-based theories. Prototype theories are 
based largely on principles of cognitive economy cham- 
pioned by Rosch and Mervis (1975). Prototype theories 
(e.g. Reed, 1972) posit that as category exemplars are 
experienced, humans gradually abstract generalities 
across them and construct a single prototypical 


representation of the category that is the central ten- 
dency of its exemplars; categorisation of a new exemplar 
depends on its similarity to category prototypes. In con- 
trast, exemplar theories (e.g. Medin & Schaffer, 1978; 
Nosofsky, 1988) posit that humans store every experi- 
enced exemplar in memory, and categorisation of a 
new exemplar depends on its weighted similarity to all 
stored exemplars. 

Perhaps more than any other sub-field of cognition, 
the categorisation literature has very much converged 
on the conclusion that exemplars have beaten out proto- 
types as models of human categorisation (but see 
Murphy, 2016, for a careful discussion of the limitations 
of both). In addition to exemplar models providing a 
better quantitative account of human categorisation 
data, there are many theoretically differentiating effects 
that are easily explainable by exemplar models but that 
are simply impossible under prototype accounts. For 
example, category structures with nonlinearly separable 
structure (e.g. XOR) are easily learned by humans, but 
impossible to account for by single prototype models 
(Ashby & Maddox, 1993; Nosofsky, 1988). Even when 
using linear category structures that should be condu- 
cive to prototype models, exemplar models still give a 
superior quantitative fit to human data (Stanton, 
Nosofsky, & Zaki, 2002). Hence, it is certainly odd that 
the field of distributional semantics is dominated by pro- 
totype models, while the field of categorisation has 
largely dismissed them in favour of exemplar accounts. 

In the typical categorisation experiment, subjects are 
presented with stimulus patterns - exemplars - 
accompanied by a category label. At test, the exper- 
imenter can present old or new exemplars, and the 
subject responds with the most appropriate category 
label for each stimulus. We can think of distributional 
learning of semantics in an analogous way: The context 
is the exemplar pattern, and the word is the label of 
the category to which this particular exemplar belongs. 

In word2vec, for example, the other words that occur 
with a target word are used as the context, or exemplar 
pattern, and the correct label is the target word. So in the 
sentence “| am drinking a glass of milk,” drinking + glass 
are used as the context to predict milk. The exemplar 
pattern for milk in this context is a localist vector with 
drinking and glass set to one and all other words set to 
zero. Across many language exemplars that are all of 
the category milk, word2vec homes in on a pattern of 
activation across its hidden layer that optimally predicts 
milk as the label given any language exemplar context 
that contains milk. In addition, the hidden layer pattern 
for milk will be very similar to other words that are pre- 
dicted by similar contexts such as juice and wine. So in 
all DSMs, an exemplar can be thought of as the context 
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pattern of other words that a target word (the category 
label) occurs with. This reframing of semantic learning 
is very similar to current state-of-the-art exemplar- 
based models of categorisation in which “... a stimulus 
is stored in memory as a complete exemplar that 
includes the full combination of other features. Thus 
the ‘context’ for a feature is the other features with 
which it co-occurs.” (Kruschke, 2008, p. 273). However, 
DSMs aggregate over the multiple exemplars to create 
an economical prototype. 

Hence, most DSMs collapse all instances of a word's 
context into a single representation, or point in high- 
dimensional space, very much consistent with represen- 
tational economy (Rosch, 1973). This process produces 
huge issues in semantic representation that are known 
to the field - for example, a homograph like bank has 
both senses of its meaning collapsed into a single rep- 
resentation, despite the fact that they are very different 
context patterns. As a result, the representation 
becomes a weighted average (biased to the more fre- 
quent sense) of the multiple senses of bank. A homo- 
graph like bank, with multiple unrelated senses, has a 
similar characteristic structure to experimental stimuli 
with XOR structure. But the prototype collapsing is a 
problem for all words with graded amounts of polysemy 
that would be captured by an exemplar-based model but 
are abstracted over by a prototype-based DSM. Multiple 
distinct statistical structures that map onto the same 
label are collapsed in most DSMs, leading to a range of 
both theoretical and practical issues for the models. 
But far from rare, multiple senses and contextual modu- 
lation patterns are really the norm in linguistic infor- 
mation (Jones, Dye, & Johns, 2016; Kintsch, 2001). 


Lessons from multiple-trace models 


Posner and Keele’s (1968) schema abstraction exper- 
iment is a classic in semantic memory research, and 
was a key laboratory phenomenon that lead Tulving 
(1972) to divide declarative memory into separate 
semantic and episodic stores in his modular taxonomy.* 
In their task, Posner and Keele presented subjects with 
random dot patterns as exemplars of multiple categories. 
Unbeknownst to subjects, the exemplars they experi- 
enced were created from parent prototype patterns for 
each category. A category exemplar was a random per- 
turbation (low or high distortion) of the prototype 
pattern, but prototypes were never shown to subjects 
during learning. There are many interesting effects 
from the schema abstraction task, but a key finding is 
that while subjects at test are better at classifying exem- 
plars they were trained on, they were better at the pro- 
totype than new exemplars. In addition, with a delay 
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between training and test, performance on the proto- 
type (which was never experienced) is better than per- 
formance on the old exemplars that subjects were 
actually trained on. Furthermore, exemplars and proto- 
types follow different trajectories of decay as a function 
of retention time. 

This pattern of results suggests that subjects are 
storing experienced exemplars in episodic memory at 
the same time as they are creating an abstracted proto- 
type for the category. The differential decay patterns also 
suggest that these information sources are stored by dis- 
tinct memory systems, and that semantic memory is 
more resilient to decay than is episodic memory. The 
pattern would seem to argue in favour of DSMs that 
use abstraction at encoding to create a prototypical 
pattern, and episodic memory is then explained by a dis- 
tinct model. 

However, Hintzman (1984; 1986) provided a classic 
demonstration using his MINERVA 2 memory model 
that questioned whether the effects seen in Posner 
and Keele’s (1968) schema abstraction task suggest the 
existence of a prototype in memory at all. Briefly, 
MINERVA 2 is a instance-based memory model: it 
stores a pattern for each exemplar in episodic memory, 
but has no semantic memory. Multiple presentations of 
an exemplar simply lay down multiple memory traces. 
The model explains a range of episodic memory effects 
such as recognition, judgments of frequency, etc. But it 
can also perform the classification task used in the 
schema abstraction experiments. When presented with 
a probe pattern (an old or new exemplar) MINERVA 2 
simultaneously computes the probe’s similarity to all 
stored exemplar traces in memory, and the retrieved cat- 
egory label for the probe is weighted by the scaled simi- 
larity of the probe to all exemplars (cf. Nosofsky’s, 1986 
exemplar-based model of categorisation). The retrieved 
pattern is referred to as an “echo” from memory, and is 
based loosely on the principle of harmonic resonance. 
Although it has no semantic memory per se, MINERVA 
2 reproduces the key phenomena in schema abstraction 
that had previously been seen as evidence for dual epi- 
sodic and semantic stores. The model performs better 
on old exemplars at immediate test (but better on the 
prototype than new exemplars), and performance on 
the prototype is better than the training exemplars 
after forgetting. Superior performance on the prototype 
is due simply to the fact that it is the central tendency of 
the exemplar patterns; hence the prototype’s pattern is 
distributed across the exemplars. The performance tra- 
jectories of exemplars and the prototype as a function 
of delay have distinct slopes. 

Hintzman’s (1984; 1986) demonstration is well 
covered in most contemporary memory textbooks - it 


is an elegant existence proof that phenomena used to 
argue for the existence of semantic memory may actually 
be due to the process of retrieval from episodic memory. 
In the interest of parsimony, there may be no need to 
posit an additional semantic store when a model that 
has only an episodic store can produce all the phenom- 
ena that a model with two distinct stores could. This 
claim bears considerable similarity to other instance- 
based models of memory and exemplar-based models 
of categorisation. So why mention historical cases like 
MINERVA 2 and schema abstraction here? Because one 
of the first successful exemplar-based DSMs in cognitive 
science extends MINERVA 2’s architecture exactly to a 
text corpus, and makes the same theoretical claims. 


Exemplar-based semantic models 


While it is true that most DSMs are prototype models, 
there is a small family of exemplar-based semantic 
models that diverge from the usual quest for cognitive 
economy. Exemplar-based semantic models are also 
referred to as “retrieval-based” models in the cognitive 
literature or simply as “memory models” in compu- 
tational linguistics. Rather than positing abstraction as 
a dimensional reduction mechanism at learning, they 
store all of a word’s episodic contexts, and abstraction 
is a consequence of retrieval from episodic memory. 
Hence, there is no semantic memory per se in these 
models, only episodic memory. In exemplar-based 
models, phenomena that have typically been attributed 
to semantic memory are an emergent artifact of retrieval 
from episodic memory. The locus of semantics is not at 
encoding, but at retrieval. These models have grown 
from exemplar-based models in categorisation, and 
instance-based models in memory. Intuitively, many 
people believe the idea that we store everything we 
ever experience rather than creating and storing an 
economical abstraction is far-fetched. But given the 
success of exemplar-based semantic models at account- 
ing for an impressive array of semantic behaviours 
without any semantic memory, and the current resur- 
gence of usage-based theories in linguistics (Goldberg, 
2006; Johns & Jones, 2015; Tomasello, 2003), exemplar- 
based semantic models deserve a closer look. 

Kwantes (2005) extended Hintzman’s (1986) MINERVA 
2 to explain semantic phenomena with words by training 
it on a text corpus. In his Constructed Semantics Model 
(CSM),* each word’s representation in memory is a 
binary vector that reflects whether it occurred in a docu- 
ment or not — its episodic history. Note that memory in 
CSM is the same word-by-document matrix that LSA 
and other DSMs learn from. But where LSA applies 
abstraction to this episodic matrix and stores a higher- 


order representation, CSM stores the episodic matrix 
itself. When a word is presented to CSM, its episodic 
vector is used as a probe as in MINERVA 2. Each word 
in memory is activated relative to its contextual overlap 
with the probe word, and the echo pattern is then the 
similarity-weighted sum of all traces in memory, exactly 
as in Hintzman (1986). Words that have similar contex- 
tual histories to the probe word will contribute more of 
their pattern to the echo than will words with rather 
independent histories, and the echo for the word is 
then an ad hoc, and probe specific, prototype created 
by the this process of retrieval from episodic memory. 
To compute the semantic similarity between two 
words, one simply computes the cosine between their 
two echo patterns. 

It is fairly obvious how CSM can determine similarity 
for words that frequently co-occur with each other 
(their echo cosines would be a noisy amplification of 
the terms’ likelihood of co-occurrence relative to 
chance). But higher-order semantic similarities also 
emerge from this process of retrieval, even between 
two words that have zero contextual overlap. For 
example, two synonyms would not activate each other 
at all because they have never co-occurred in the text 
corpus, but they would activate many of the same 
other words due to their similar contextual usage; as a 
result, their retrieved echo patterns are extremely 
similar. Models such as LSA and word2vec accomplish 
this second-order statistical inference while learning a 
corpus, whereas CSM does it while retrieving information 
from episodic memory. 

Hence, semantic abstraction in CSM is a parallel to 
schema abstraction in MINERVA 2: the prototype is an 
emergent property of retrieval from episodic memory. 
As Kwantes (2005) puts it, CSM “... takes what it knows 
about a word’s contexts and uses retrieval to estimate 
what other context might also contain the word” 
(p. 706). The model bears obvious similarity in outcome 
to prototype-based DSMs, but it differs considerably in 
the psychological mechanism that it attributes abstrac- 
tion to; in CSM, it is the well-established process of retrie- 
val that uncovers deeper semantic structure. 

As with Hintzman’s (1986) demonstration, CSM is a 
more parsimonious model of semantics — it does not 
require two separate stores or processes to explain 
semantic and episodic memory, and serves as an exist- 
ence proof that semantic phenomena may be explained 
by a model that only posits an episodic memory store. In 
addition, the success of the model is reinforced by con- 
verging evidence supporting exemplar-based models 
in the fields of categorisation and recognition. Further- 
more, there are real benefits to CSM that allow it to 
handle phenomena not possible by abstractionist 
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DSMs. For example, it can handle polysemous words 
because the multiple senses of the words are still rep- 
resented and are dissociable with nonlinear activation 
of exemplars (cf. Nosofsky, 1986). Memory traces 
whose context fits one or the other sense of a word 
can be differentially activated in CSM. Abstractionist 
DSMs, on the other hand, collapse multiple senses of a 
word to a single point in high-dimensional space, 
losing the distinction in favour of an averaged 
representation.” 

Kwantes (2005) work suggests that the same basic 
memory system could underlie both episodic and 
semantic knowledge, and his work has given rise to a 
handful of other models that have explored semantic 
abstraction as a memory retrieval operation rather than 
a learning mechanism. For example, Dennis (2005; see 
also Thiessen, 2017) presented a memory-based model 
of verbal processing, including semantics and syntactic 
information as retrieval from long-term memory and 
constraint satisfaction in working memory. The model 
mechanisms are based on a Bayesian interpretation of 
string edit theory from linguistics. Dennis’ model posits 
that processing a word or sentence is at its core a 
memory-retrieval process. 

Johns and Jones (2014, 2015; see also Thiessen & 
Pavlik, 2013) extended this previous work into an exem- 
plar-based model, based on a hybrid of Hintzman’s 
MINERVA 2 (1986) and Jones and Mewhort’s (2007) 
BEAGLE architectures, that encodes sentences from a 
natural language text corpus into individual memory 
exemplars. The retrieval mechanism is used to generate 
expectancies about the future structure of sentences, 
much in the same way as Kwantes (2005) constructs a 
word’s meaning as a prediction of the future contexts 
in which it might occur. Johns and Jones found that 
such an exemplar-based model successfully accounted 
for a wide range of sentence processing tasks that had 
commonly been seen as evidence for rule-based abstrac- 
tion of linguistic constraints. Johns, Jamieson, Crump, 
Jones, and Mewhort (2016) extended this model to 
demonstrate that rule-based grammatical behaviour is 
a natural emergent property of retrieval from a model 
that stores exemplars of linguistic experience. Hence, 
both semantics and syntax may very well be constructed 
properties of retrieval from episodic memory rather than 
abstracted structures or rules, per se.® 


Exemplar-based models in natural language 
processing 


It is tempting to think of exemplar-based models as a 
psychology centric theory with little, if any, practical sig- 
nificance. After all, why would a computing scientist 
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want to store all data instances? Data compression and 
abstraction are core goals to information retrieval appli- 
cations. However, exemplar models are now seeing con- 
siderable use in natural language processing (NLP) as 
well, for the very same reasons that they are preferred 
in categorisation: the affordance of nonlinear activation 
of memory exemplars given a probe. 

A classic example in NLP was presented by Daele- 
mans, Van Den Bosch, and Zavrel (1999), showing that 
abstractionist models lose exceptions to common pat- 
terns in a variety of language processing tasks. Prototype 
models offer the best single representation, but the dis- 
tribution of meanings and usage rules is heavily skewed 
in natural languages — prototype models discard the tail 
(cf. Johns & Jones, 2010 in lexical semantics). Daelmans 
et al. found that retaining exceptional training instances 
in memory was actually beneficial for generalisation 
accuracy across a wide range of common NLP tagging 
tasks, and they argue that the field needs to take exem- 
plar-based memory models much more seriously. 

More recently, exemplar-based models have seen a 
resurgence in NLP, offering better accuracy on applied 
problems that the field had been deadlock on with 
abstractionist models. For example, Erk and Padé 
(2010) used an exemplar-based memory model in 
which separate exemplars were encoded for words and 
sentences (cf. Dennis, 2005; Johns & Jones, 2015). They 
found superior performance on a practical paraphrase 
task using this architecture due to the nonlinear acti- 
vation of related exemplars - this behaviour allowed 
the model to “ignore” the exemplars that were other 
senses of a target word, which would have been a col- 
lapsed noise source in an abstractionist model. In fact, 
their exemplar model outperformed all then state-of- 
the-art paraphrasing models, and has considerable simi- 
larity to exemplar-based memory models in cognitive 
science (e.g. Thiessen & Pavlik, 2013). 

However, an issue with the application of exemplar- 
based models to applied NLP tasks will always be proces- 
sing time. Exemplar-based models are fast to train, but 
require substantial and even computationally impractical 
memory resources, and are slow to retrieve the correct 
answer. In contrast, prototype models embody data 
compression, putting all the time into training the 
single best representation, but then the search time for 
a similar instance in memory is much more efficient. In 
applied problems, such as information retrieval, access 
time is everything. However, there have been many suc- 
cessful hybrid models emerging in NLP that balance 
accuracy with generalisation and speed. Multiple proto- 
type models (e.g. Reisinger & Mooney, 2010) have 
become popular to represent the distinct senses of a 
word without needing to store all exemplars, and are 


quite similar to multiple prototype theories of categoris- 
ation (Minda & Smith, 2001). Similarly, there has been 
considerable success in NLP with models that represent 
words as regions, rather than points, in distributional 
space (Erk, 2009; Vilnis & McCallum, 2015). These 
models preserve nonlinear activation of exemplars, but 
while embedding them in a more reasonable search 
space with attractor basins. The practice has a similar 
outcome to setting a threshold on the similarity function 
in exemplar-based psychological models to reduce the 
activation of irrelevant items (which is precisely what 
Kwantes, 2005, model does). This also suggests consider- 
able potential for the application of hybrid rule-and- 
exception models from human category learning (e.g. 
Nosofsky, Palmeri, & McKinley, 1994). 

Also of interest in practical NLP applications is the 
recent rise of so-called memory networks (Weston, 
Chopra, & Bordes, 2015) that have proven very successful 
at open question answering with complex real-world text 
materials. Memory networks use a long-term exemplar 
memory network as a dynamic knowledge base, and 
have produced state-of-the-art results with difficult 
tasks such as question answering, summarisation, and 
text-based inference (Bordes, Usunier, Chopra, & 
Weston, 2015). 


Discussion 


Meaning is a fundamental human attribute that perme- 
ates all cognition, from low-level perceptual processing 
to high-level problem solving, and everything in 
between. Semantic abstraction is what makes us a 
powerful species — informavores. The idea that humans 
construct and store abstracted semantic representations 
for concepts is almost sacred in cognitive science. But it 
is also at odds with conclusions from other areas of cog- 
nition, such as categorisation and recognition, which pre- 
sumably tap aspects of the same cognitive mechanism as 
semantic learning. And exemplar-based DSMs suggest 
that, like Hintzman’s (1986) demonstration, we might 
be able to explain all the same semantic phenomena 
without a semantic memory. According to exemplar- 
based DSMs, semantic memory is a process, not a 
structure. 

It is tempting to see exemplar-based DSMs as “cheat- 
ing:” If the model simply stores all data, then it can 
compute an accurate semantic representation whenever 
one is needed. But the theoretical claim is profound - it is 
a frightening proposal that we may not actually have 
semantic memory. Your interpretation of the words 
you are reading right now may be constructed on the 
fly as an artifact of retrieving the visual patterns from epi- 
sodic memory. Our phenomenology of meaning may be 


continuously constructed as the interaction between 
stimuli, episodic memory, and the memory retrieval 
mechanism that mediates them (Kintsch & Mangalath, 
2011). But exemplar-based DSMs should also put us at 
ease — they provide converging evidence that perform- 
ance across multiple cognitive domains (e.g. categoris- 
ation, recognition, semantics) may be explained by the 
same unified cognitive principle. Exploring exemplar- 
based DSMs also has practical considerations for edu- 
cation, where exemplar-based models of categorisation 
have been successfully applied (Norman, Young, & 
Brooks, 2007; Nosofsky, 2017). And since humans are 
both the producers and consumers of linguistic infor- 
mation in all practical NLP tasks, it is also reassuring 
that the recent findings in NLP may suggest that the 
best models to serve humans in these tasks bear con- 
siderable similarity to the models we believe human cog- 
nition has evolved to use. 

How might exemplar-based semantic models be 
implemented in neural hardware? A common criticism 
against exemplar theory is that it is stranded at the com- 
putational level of Marr’s hierarchy, but the transition to 
implementation is untenable. The claim that we simply 
store everything that we experience seems unintuitive 
and goes against the core principle of cognitive 
economy. However, Hintzman (1990) has shown how 
an exemplar-based memory model such as MINERVA 2 
can be easily implemented within a neural network fra- 
mework. In addition, there is a small body of work 
attempting to understand and formalise biologically 
plausible exemplar theories of recognition and categoris- 
ation, which have typically pointed to a role for the hip- 
pocampus and surrounding medial temporal lobe 
structures (e.g. Pickering, 1997). Futhermore, Becker's 
(2005) models cleanly demonstrate that hippocampal 
coding would give rise to distinct memory represen- 
tations for highly similar items. 

More recent work in categorisation is now focusing on 
the basal ganglia and striatum as giving rise to the oper- 
ations needed for exemplar models (see Ashby & Rose- 
dahl, 2017, for a review). Ashby and Rosedahl recently 
introduced a neural implementation of exemplar 
theory, in which a key role for the formation of category 
exemplars is assigned to synaptic plasticity at cortical- 
striatal synapses. Rather than storing strict exemplars, 
per se, their model adds nodes and manipulates connec- 
tivity between striatal and sensory neurons, achieving 
the same effect as classic exemplar models. Ashby and 
Rosedahl show that their neural implementation of 
exemplar theory is mathematically equivalent to classic 
exemplar theories such as the General Context Model 
(Nosofsky, 1986), and makes identical predictions. The 
work of Ashby & Rosedahl establishes an important 
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equivalence between classic exemplar based models 
and neural exemplar theories. Not only do exemplar the- 
ories provide superior quantitative fits, but increasing 
biological plausibility also extends predictions to find- 
ings from cognitive neuroscience (e.g. Ashby & Valentin, 
2016; Hélie, Paul, & Ashby, 2012; Valentin, Maddox, & 
Ashby, 2014). 

The predominance of prototype-based DSMs in the 
literature may be partially due to Chomskian presump- 
tions in linguistics that the job of the cognitive mechan- 
ism is to abstract the rules of a grammar from instances. 
This abstractionist presumption may have implicitly 
guided architectural decisions in early DSMs. In addition, 
the notion of cognitive economy (Rosch & Mervis, 1975) 
was a guiding principle to models of semantic abstrac- 
tion. However, both of these theoretical presumptions 
are currently being revisited given the strength of 
usage-based theories in linguistics (Tomasello, 2003). 
But a more likely reason for the preference of prototype- 
over exemplar-based DSMs in practice is that exemplar 
models are much more computationally expensive 
than prototype models. The front-end data compression 
core to prototype DSMs means that they require far less 
memory to store, and are far more efficient to use, than 
exemplar-models. 

Development of DSMs in general has benefited from 
cross-disciplinary interactions with applied fields, such 
as information retrieval. Put simply, these models are 
both theoretically informative to cognitive science, and 
useful for practical NLP tasks. However, the utility of 
the model should not constrain its theoretical informa- 
tiveness. Prototype DSMs are needed because they 
provide an efficient and economical estimate of a 
word’s aggregate meaning. Exemplar-based DSMs 
contain more information, but the retrieval problem 
becomes intractable and untenable for practical tasks. 
Nobody wants to type a search query into Google and 
have it determine what you mean by activating and 
weighting exemplars in real time; the time-intensive 
computation should have been completed and stored 
long before you type in the query words. 

But the constraint of utility in NLP may have had the 
unwanted effect of guiding our theoretical models of 
the mind away from exemplar models. There are many 
differences between the brain and computational data- 
bases in how they represent and retrieve information. 
The search and abstraction processes used in cognition 
need not be identical to those best for database 
search. Models of cognition have long assumed that 
memory exemplars can be activated in parallel, although 
the code we use to implement this in a model will usually 
use a loop routine. This is a distinct difference between 
the two disciplines: Looping through all exemplars is 
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not an efficient method of, for example, word similarity 
matching, but it may well be the correct model of how 
humans do it. The constraints of our current compu- 
tational hardware should not be used as reasons to 
discard otherwise superior fitting models of human cog- 
nition, such as exemplar models. 


Notes 


1. LSA and word2vec are formally equivalent: Levy and 
Goldberg (2014) demonstrated analytically how the 
SGNS architecture of word2vec is implicitly factorizing 
a word-by-context matrix whose cell values are shifted 
PMI values. 

2. The model's direction can also be inverted, using the 
word to predict the context (SGNS) rather than using 
the context to predict the word (CBOW). 

3. The bulk of the evidence used by Tulving to argue for dis- 
tinct semantic and episodic memory systems was from 
neuropsychological patients. 

4. The model is simply referred to as the “semantics model” 
in Kwantes’ (2005) original paper, but “Constructed 
Semantics Model” has become it’s popular name 
among semantic modelers because semantic represen- 
tations are constructed on the fly from episodic 
memory in the model. 

5. An exception here is the topic model, which uses con- 
ditional probabilities, so it is not subject to metric restric- 
tions of spatial models (eg., Griffiths, Steyvers, & 
Tenenbaum, 2007). 

6. And essentially the same architecture has been used by 
Goldinger (1998) to explain “abstract” qualities of spoken 
word representation from episodic memory retrieval. 
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